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Abstract 


Global  load  balancing,  if  practical,  would  allow  the  effective  use  of 
maasively-pari^  ensemble  architectures  for  large  soft-real-time  prob¬ 
lems.  The  challenge  is  to  replace  quick  global  communications,  which  is 
impractical  in  a  massively-parallel  system,  with  statistical  techniques. 
In  this  vein,  proposers  novel  approach  to  decentralised  load  bal¬ 
ancing  baaed  on  statistical  time-series  analysis.  Each  site  estimates 
the  system-wide  average  load  using  information  about  past  loads  of 
individual  sites  and  attempts  to  equal  that  average.  This  estimation 
process  is  practical  because  the  soft-real-time  systems  we  are  interested 
in  naturally  exhibit  loads  that  are  periodic,  in  a  statistical  sense  akin  to 
seasonality  in  econometrics.  We  AW  how  this  load-characterisation 
technique  can  be  the  foundation  for  a  load-balancing  system  in  an 
architecture  employing  cut-through  routing  and  an  efficient  multicast 
protocol.  ^ 
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1  Introduction 


Our  research  group,  the  Stanford  Knowledge  Systems  Laboratory  Advanced 
Architectures  Project,  is  exploring  the  construction  of  massively-parallel, 
object-oriented,  knowledge-based,  soft-real-time  signal-interpretation  sys¬ 
tems.  It  seemed  clear  early  on  that  some  sort  of  adaptive  load-distribution 
scheme  would  be  necessary  to  allocate  resources  to  such  dynamic  systems. 
Otherwise,  in  order  to  assure  acceptable  real-time  performance,  the  system 
could  only  be  lightly  loaded,  and  the  large-scale  signal-interpretation  prob¬ 
lems  the  massive  parallelism  was  intended  to  allow  would  not  be  possible. 
The  remainder  of  this  section  explains  why  we  desire  a  scheme  which  globally 
balances  loads  by  migrating  objects,  and  how  we  can  exploit  the  somewhat 
periodic  nature  of  our  systems*  loads  to  do  global  balancing  in  a  manner 
appropriate  to  thousands  of  processing  elements. 

Much  discussion  in  the  load-distribution  literature  recently  has  centered 
on  the  choice  of  load  balancing  vs.  load  sharing  [14].  While  load  balancing 
strives  to  keep  all  sites  equally  loaded,  load  sharing  merely  tries  to  prevent 
unnecessary  idleness.  Load  balancing  is  appropriate  to  object-oriented  real¬ 
time  systems  because 

•  real-time  systems  need  to  prevent  long  waits  for  processing — ^ioad  bal¬ 
ancing,  by  reducing  the  variance  as  well  as  the  average  of  waiting  times 
better  achieves  this;  also, 

•  migrating  objects  to  balance  current  load  tends  to  also  balance  the 
future  arrival  of  additional  work  at  utes. 

Traditionally,  decentralized  adaptive  load-balancing  systems  have  been 
local:  they  balance  loads  in  small  neighborhoods  (the  neighborhoods  may 
be  logical,  rather  than  physical),  and  rely  on  repeated  local  adjustments  to 
achieve  global  balance.  (For  a  clear  example,  see  the  descriptions  of  diffusion 
in  [12,13].)  We  find  this  inappropriate  to  our  circumstances  because 

•  modern  interconnection  networks  employing  cut-through  or  wormhole 
routing  reduce  the  importance  of  locality  [7], 

•  local  techniques  can  fall  prey  to  oscillation  and  wave-front-like  propa¬ 
gation  in  the  face  of  non-ideal  conditions,  and 

•  local  techniques  have  difficulty  responding  quickly  enough  for  dynamic 
and  time-critical  systems. 
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A  global  load-balanciiig  system  most  somehow  allow  each  site  to  estimate 
the  current  (or  near-future)  system-wide  total  load,  in  order  that  it  may  ac¬ 
quire  or  jettison  sufficient  work  to  bring  its  own  load  to  the  system- wide  av¬ 
erage.  This  seems  incompatible  with  the  constraints  of  a  massively-parallel 
system:  a  site  in  a  massively-parallel  system  must  wait  a  considerable  time 
to  acquire  global  knowledge. 

This  apparent  contradiction  can  be  reconciled  by  using  a  stochastic  time- 
series  model  to  use  prior  load  information  to  predict  current  loads.  However, 
this  approach  is  useless  in  most  computer  systems,  as  thdr  loads  are  not  very 
predictable. 

Luckily,  the  real-time  systems  we  are  interested  in  (and  many  others) 
exhibit  a  different  behavior.  Thar  loads  are  periodic — not  ri^dly  so,  but 
rather  in  the  same  loose,  statistical  sense  as  many  economic  variables  are 
seasonal.  This  periodicity  is  induced  by  sampled  or  scanned  inputs  and  by 
sample- to-sample  or  scan-to-scan  consistency  in  the  outside  world.  Period¬ 
icity  makes  the  loads  more  predictable,  at  least  for  lead  times  not  greater 
than  the  period.  As  the  period  is  generally  relatively  long,  each  site  can 
have  complete  knowledge  of  loads  at  least  through  one  period  ago.  This  al¬ 
lows  reasonably  accurate  pr^iction  of  current  (or  near-future)  system-wide 
loads. 

Notice  that  the  statistical  nature  of  this  approach  makes  it  appropriate 
to  massively-parallel  systems  with  thousands  of  processing  elements: 

•  The  large  number  of  sites  makes  more  straightforward  methods  em¬ 
ploying  global  communications  impractical. 

•  On  the  other  hand,  the  large  number  of  sites  is  necessary  to  make  the 
statistical  methods  valid. 

We  are  not  suggesting  this  approach  for  real-time  systems  which  are 
rigidly  periodic;  more  direct  use  can  be  made  of  their  periodicity.  For  exam¬ 
ple,  Van’s  "post-game  analysis”  method  [17]  could  be  used  to  successively 
refine  a  quasi-static  mapping. 

2  An  Example  Time  Series 

In  this  section  we  examine  the  evolution  over  time  of  the  system-wide  load  in 
one  of  our  real-time  systems — an  aircraft  tracking  and  classification  system 
[16].  We  show  that  a  simple  stochastic  model  reasonably  approximates  this 
time  series,  that  it  is  consistent  with  a  common-sense  understanding  of  the 
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Figure  1:  A  sample  of  a  load  time  series. 
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system,  and  that  it  allows  moderately  accurate  prediction  without  recent 
complete  information.  Two  notes  are  in  order: 

•  Only  the  earliest,  simplest,  most  data-driven  stage  of  the  system  was 
operational  when  this  data  was  taken;  this  results  in  a  more  regular 
time  series  than  would  otherwise  be  the  case.  In  particular,  diagnostic 
tests  show  our  model  to  be  incomplete,  in  that  it  misses  a  couple  of 
sub-periods  caused  by  the  structure  of  the  computation.  We  expect 
the  structure  of  a  complete  system  to  be  complex  enough  not  to  show 
through  in  the  load  time  series. 

•  The  plots  in  Figures  1  and  4  below  show  only  a  typical  interval  out  of 
the  larger  time  series  which  was  analyzed. 

Figure  1  shows  the  load  over  ten  periods;  each  period  is  ten  time  quanta 
long,  and  the  load  value  for  each  quantum  is  an  average  total  of  task  queue 
lengths  over  that  quantum.  Notice  that  the  pattern  gradually  shifts  from 
period  to  period.  Also,  notice  that  as  the  observed  activity  diminishes,  the 
system’s  performance  varies  from  not  quite  keeping  up  with  the  input  to 
having  a  relatively  long  period  of  quiescence  between  cycles.  It  is  charac¬ 
teristic  of  real-time  systems  that  they  are  sized  so  as  to  perform  acceptably 
during  peak  periods,  even  if  this  means  idleness  at  other  times;  this  allows 
the  periodicity  of  the  input  to  show  through  as  a  periodicity  of  the  load. 
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The  sub-periods  referred  to  above  are  also  visible  in  the  graph — the  coarse 
sampling  and  small  excerpt  obscure  it  somewhat,  but  each  major  peah  is 
followed  by  two  smaller  peaks  whose  sixes  correlate  with  each  other  and  that 
of  the  major  peak. 

2.1  Stochastic  model 

We  analyzed  this  series  using  the  methods  of  Box  and  Jenkins  [3]^,  and  iden¬ 
tified  as  a  suitable  first-cut  model  for  it  a  multiplicative  integrated  moving 
average  (IMA)  process  of  orders  (0,1,1)  x  (0,  l,l)io>  This  model  has  the 
form: 


zt  —  Z(-i  -1-  zt-io  —  Zt-ii  +  flj  —  —  6a(-io  + 

where  Zt  is  the  system-wide  load,  Ot  is  a  white-noise  series,  and  0  and  6 
are  parameters.  The  structure  of  this  process  is  more  evident  when  written 
using  the  backwards  shift  operator  B: 

(1  B‘®)2,  =  (1  -  0B){1  -  eB^®)o,. 

Adding  the  constraint  that  loads  must  be  non-negative  improves  this  basic 
model. 

This  model,  while  suggested  by  statistical  evidence,  is  also  plausible  in 
terms  of  the  mechanism  of  the  system.  The  non-periodic  component  of  the 
model  essentially  states  that  the  load  persists,  except  that  it  is  subject  to 
random  perturbations.  Some  fraction  (0)  of  each  random  perturbation  is  of 
short-term  effect  only,  while  the  remainder  lasts  until  counteracted;  this  fits 
well  with  a  birth-death  view  of  processes.  The  periodic  component  of  the 
model  is  identical  in  form,  and  can  be  similarly  justified:  the  aircraft  under 
observation  (and  thus  the  load  pattern)  remain  constant  except  for  random 
perturbations,  some  fraction  (1-6)  of  which  are  long-lasting  entries  or 
departures  from  the  field  of  observation. 

This  model  belongs  to  the  broad  class  of  stochastic  processes  known  as 
ARMA  (autoregressive-moving  average)  processes.  It  is  interesting  to  ask 
why  this  particular  ARMA  process  should  be  chosen — might  others  not  fit 
as  well?  The  answer  is  partially  that  this  is  the  simplest  periodic  ARMA 
process  whose  periodic  and  non-periodic  components  are  both: 

•  non-stationary  (i.e.,  they  have  no  fixed  level), 

'The  equations  in  this  section  are  teptodnced  with  minor  changes  in  notation  from  [3]. 
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•  stabl«  (i.e.,  they  don’t  grow  explosivdy),  and 

•  homc^eneous  (i.e.,  everywhere  self-similar  except  for  levd). 

Naturally  a  higher-order  process  could  be  used,  which  would  fit  better. 
However,  it  is  generally  preferable  to  use  the  simplest  suitable  modd.  An¬ 
other  possibility  would  be  to  drop  the  requirement  of  level  independence 
by  expanding  the  model  to  include  a  stationary  autoregressive  operator, 
i.e.  by  making  it  ARIMA  (autoregressive-integrated  moving  average)  rather 
than  merely  IMA.  It  can  be  argued  that  a  busier  system  will  spawn  more 
processes,  or  alternatively  that  a  busier  system  will  run  more  processes  to 
completion.  We  left  this  component  out  of  our  model  because 

s  in  a  loaded  system,  the  activity  is  not  proportional  to  the  loiul  (as 
additional  load  means  additional  waiting  tasks,  rather  than  additional 
running  tasks),  and 

•  the  statistical  evidence  does  not  unambiguously  suggest  such  a  com¬ 
ponent. 

Diagnostic  tests,  as  suggested  by  Box  and  Jenkins,  showed  that  the 
model  was  only  roughly  fitting,  due  in  part  to  the  unmodeled  sub-periods. 
This  is  especially  evident  in  the  cumulative  periodogram  of  residuals,  repro¬ 
duced  in  Figure  2;  the  bulge  around  frequency  0.25  (period  4)  shows  that 
the  model  misses  some  periodicity  in  that  neighborhood.  (A  cumulative  pe¬ 
riodogram  shows  an  integrated  power  spectrum.  A  perfectly  fitting  model 
would  leave  white-noise  residuals  with  a  flat  power  spectrum  and  hence 
a  straight  diagonal  cumulative  periodogram.)  Even  the  cumulative  peri¬ 
odogram  of  ideal  white-noise  residuals  might,  because  of  the  limited  sample 
size,  deviate  outside  the  dashed  lines  approximately  25%  of  the  time  (the 
limit  lines  are  calculated  from  the  Kolmogorov-Smirnov  test).  Therefore,  as 
the  bulge  just  reaches  the  25%  limit  line,  it  can’t  be  considered  an  especially 
serious  failure  of  the  model.  On  the  other  hand,  other  statistical  evidence 
and  our  understanding  of  the  system  indicate  that  the  model  is  genuinely 
incomplete,  rather  than  the  bulge  merely  being  an  artifact  of  the  limited 
sample  size.  We  felt  that  incorporating  these  sub-periods  into  the  model 
would  be  artificial,  however,  both  because  they  sire  an  artifact  of  the  sim¬ 
plicity  of  the  sample  system,  and  also  because  they  are  not  a  priori  known 
(or  necessarily  constant),  unlike  the  externally  imposed  period. 
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Figure  2:  Normalized  cumulative  periodogram  of  residus®^***”®^ 


2.2  Forecasting 

The  non-periodic  component  of  the  model  is  that  which  is  conventionally 
used  for  aperiodic  computer  systems;  it  ^ves  rise  to  the  familiar  exponen¬ 
tially-weighted  average  forecast  function.  The  periodic  component  in  effect 
adds  an  exponentially-weighted  average  of  corrections  to  this  forecast,  de¬ 
rived  from  the  experience  at  corresponding  points  in  earlier  periods.  For¬ 
mally,  the  best  one-step-ahead  forecast  possible  for  the  model  is  found  by 
assigning  weights  Xj  to  the  loads  j  steps  earlier,  where 

x^  =  ^-‘(1-d),  i  =  l,...,9 
x,o  = 

xji  = 

Xj  =  ffxj_i  -I-  0Xj-io  -  ®05rj-n,  j  >  12. 

Depending  on  the  relationship  between  0  and  0,  the  heaviest  weight 
in  the  forecast  may  either  be  on  the  most  recent  value,  or  on  the  one  a 
period  ago.  In  the  aircraft  tracking  case  (and  many  others,  we  speculate), 
there  is  more  consistency  from  period  to  period  than  from  instant  to  instant 
(as  aircraft  are  more  inertial  than  processes).  This  leads  to  the  wdghts 


•?ji.^srrfc<acao. 
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illastrated  in  Figure  3,  which  were  computed  from  the  values  for  9  and  6 
that  best  fit  our  sample  series. 

Forecasts  can  also  be  computed  directly  from  the  differer.ire  equation  we 
used  to  define  the  model.  In  either  case,  forecasts  for  greater  lead  times  can 
be  calculated  by  repeated  use  of  the  step-ahead  formula.  (By  lead  time  we 
mean  the  time  from  when  the  total  load  is  last  kn.>wn  to  when  the  forecast 
is  for.) 

Since  the  period  (in  this  case,  the  scan  time  of  a  radar)  is  long  relative  to 
the  communication  latendes  of  the  system,  it  is  reasonable  to  suppose  that 
each  site  can  have  complete  knowledge  of  all  other  sites’  loads  at  least  up 
until  one  period  earlier,  with  diminishing  knowledge  thereafter.  It  should 
be  possible  in  principle  to  make  some  use  of  the  more  recent,  incomplete, 
information  to  improve  the  forecast,  given  a  model  of  the  load  distribution 
with  load  balancing.  In  the  next  section  we  address  this  problem  and  show  a 
heuristic  solution.  However,  Figure  4  shows  that  even  forecasts  made  using 
only  data  up  through  one  period  in  advance  are  usually  moderately  accurate. 
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2.3  How  tsrpical  is  this  example? 

Though  this  section  presented  a  case  study  of  a  single  time  series  taken  from 
a  single  appMcation,  we  believe  the  basic  features  are  common  to  other  sys* 
terns  as  well.  Preliminary  results  from  experimentation  with  a  passive  radar 
interpretation  system  [4]  confirm  this  belief.  The  IMA  (0, 1, 1)  x  (0, 1,  l)p 
model  used  here  may  well  suit  many  such  systems,  though  its  suitability 
should  of  course  be  tested  in  each  case.  As  well  as  testing  the  suitability  of 
the  model  to  a  particular  application,  it  is  necessary  to  tune  the  parameters 
using  sample  time  series.  Systems  with  more  than  one  period,  for  example 
from  heterogeneous  sensors,  would  necessitate  a  straightforward  extension 
of  the  model. 

One  potential  stumbling  block  in  generalizing  this  technique  to  more 
realistic  systems  is  that  higher-level  processing  tends  to  be  triggered  by 
significant  changes  in  the  input  (or  by  the  lack  of  expected  changes),  rather 
than  by  the  input  itself.  For  example,  a  system  that  not  merely  tracks 
aircraft,  but  also  attempts  to  deduce  possible  objectives,  would  reconsider 
the  objective  of  an  aircraft  that  sharply  turned,  or  that  failed  to  turn  when 
it  was  expected  to.  This  reduces  the  scan-to-scan  consistency  of  the  load.  It 
remains  to  be  seen  how  troublesome  this  is;  clearly  this  depends  on  how  much 
of  the  processing  is  special-case.  When  this  issue  came  up  in  a  discussion 
with  a  group  familiar  with  actual  systems,  the  consensus  was  that  the  load 
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on  present-day  systems  is  indeed  quite  periodic  [15]. 

3  Incorporating  Incomplete  Information 

The  simple  stochastic  model  presented  in  the  preceding  section  only  allows 
load  information  old  enough  to  be  complete  (i.e.  avmlable  from  all  sites)  to 
be  used.  In  this  section  we  refine  our  model  to  allow  incomplete  information 
(i.e.,  more  recent  loads  from  some  sites)  to  be  employed.  We  formulate  the 
problem,  show  an  exact  but  impractical  solution,  and  then  present  provably 
good  practical  heuristic  approximations. 

3.1  The  problem 

In  order  to  understand  what  use  a  site  can  make  of  recent  but  incomplete 
information,  we  must  refine  our  model  to  include  how  the  system-wide  total 
load  is  divided  among  the  N  sites.  A  simple,  plausible  version  of  this  is  to 
assume  that  the  sites  are  independent  instantaneously,  but  in  the  longer- 
term  are  successfully  balanced.  Formally,  the  model  we  have  in  mind  is 

.  ^t~i  +  zt-iQ  - 

=  a;.,  -t- - - - , 

where  we  use  for  the  load  of  site  i  at  time  t  (with  zt  = 
similarly  for  ai,(  and  ot  (the  a,-,(  are  independently  normally  distributed, 
with  variance  a*). 

As  long  as  all  are  known,  the  a,',(  can  be  calculated,  and  thus  used 
for  forecasting.  When  the  information  is  incomplete,  the  deviation  of  the 
known  a.-,t  from  the  step-ahead  forecasts  can  no  longer  be  attributed  solely  to 
their  corresponding  aj,(,  but  rather  will  also  include  the  persistent  fraction  of 
earlier  unknown  perturbations.  The  problem  is  to  find  the  expected  division 
between  these  two  sources  of  perturbation,  as  the  expected  value  of  each  a,',t 
should  be  incorporated  into  the  forecast  in  its  own  way. 

3.2  Exact  solution 

This  problem  can  be  solved  by  applying  Bayes’s  theorem: 

•  We  are  given  as  a  prior  distribution  for  the  ai,t  that  they  are  indepen¬ 
dently  normally  distributed  with  some  variance 
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•  We  make  observations  which  imply  a  joint  likelihood  for  the  an^t  that 
is  uniform  where  certain  linear  combinations  of  them  (given  below) 
equal  the  known  ri.t  and  zero  elsewhere. 

•  We  would  like  to  find  the  posterior  joint  distribution  of  the  specif¬ 
ically  its  expected  value,  for  use  in  forecasting. 

The  non-zero  regions  of  the  likelihood  function  can  be  found  by  rewriting 
the  equation  for  Zt,(  in  terms  of  the  a,-,t  alone,  using  the  summation  operators 
5  =  (1  -I-  55)  and  Sio  =  (1  +  5,o5*0): 

((1  -  9)SB  -I-  (1  -  e)5io5i®  -I-  (1  -  «)(1  -  e)55io5“)at 

+  - ]y  • 

The  posterior  distribution  can  readily  be  written  using  Bayes’s  theorem, 
provided  one  is  willing  to  leave  some  messy  integrals  in  it.  Unfortunately, 
this  leaves  numerical  integration  as  the  only  way  to  find  the  needed  expected 
value.  This  seems  to  be  too  much  work  to  expect  a  load-balancing  system 
to  perform  each  time  interval.  What  is  needed  is  a  pre-posterior  analysis — a 
general  analysis  done  in  advance,  into  which  specific  numbers  can  be  plugged 
at  run  time.  Unfortunately,  we  know  of  no  such  approach  to  this  problem  in 
the  general  case.  In  the  next  subsection  we  consider  heuristic  approximations 
appropriate  to  our  intended  implementation.  The  analysis  above  serves  as 
the  standard  by  which  the  heuristics  are  judged,  as  well  as  suggesting  them. 

3.3  Heuristic  approximations 

The  simplest  heuristic  is  to  simply  assume  that  the  full  deviation  of  each 
known  load  z,-,(  from  its  step-ahead  forecast  is  purely  its  corresponding  ai,t. 
This  heuristic  is  actually  the  truth  (given  our  model)  for  the  first  time- 
quantum  with  incomplete  information,  and  can  be  shown  to  be  a  conser¬ 
vative  approximation  provided  there  is  less  than  a  period  of  incomplete 
information.  By  a  conservative  approximation,  we  mean  that  this  heuris¬ 
tic  is  guaranteed  to  be  more  accurate  than  simply  ignoring  the  incomplete 
information.  This  is  because  mistaking  the  retained  portion  of  prior  pertur¬ 
bations  for  current  perturbation  leads  to  it’s  being  erroneously  re-multiplied 
by  (1  —  6),  i.e.  underestimated. 

We  can  improve  this  approximation  by  taking  aulvantage  of  one  feature 
of  our  intended  implementation.  The  implementation  we  suggest  in  section  5 
uses  a  randomized  style  of  information  spreading  known  as  “rumor  monger- 
ing”  which  spreads  each  site’s  load  information  to  an  exponentially  widening 
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fraction  of  the  other  sites.  Thus  the  amount  of  load  information  a  site  has 
drops  off  exponentially  with  recency,  and  only  the  earliest  incomplete  load 
information  is  of  any  real  significance. 

In  particular,  for  realistic  parameters  (e.g.  a  spreading  factor  of  eight) 
the  only  significant  improvement  that  could  be  made  in  the  above  simple 
heuristic  would  be  to  better  account  for  the  deviations  observed  in  the  sec¬ 
ond  incomplete-information  time-quantum.  Moreover,  this  division  between 
the  first  two  incomplete-information  time-quanta  need  not  make  use  of  in¬ 
formation  from  later  time-quanta,  as  such  information  would  be  very  weak 
under  these  assumptions.  This  leaves  a  tractable  two-quanta  version  of  the 
general  problem  of  the  preceding  subsection. 

The  a,-,t  from  the  Nn  non-reporting  sites  of  the  first  quantum  can  be 
lumped  together,  as  can  those  from  the  Nr  reporting  sites  of  the  second 
quantum.  This  is  because  of  the  symmetry  amongst  them.  We  will  call  the 
contribution  of  the  former  to  the  second-quainta  deviations  X  and  that  of 
the  latter  Y.  Our  prior  distributions  for  them  are  independent,  normal,  both 
have  mean  zero,  and  (by  elementary  probability  theory)  have  the  variances 

<-2  = 

<rl  =  Nrcrl. 

We  know  that  X  and  Y  sum  to  the  observed  deviation,  6,  of  the  second- 
quanta  loads  from  their  step-ahead  forecasts.  Therefore,  the  posterior  dis¬ 
tribution  from  Bayes’s  theorem  gives  us  the  following  posterior  expected 
values: 


E(X) 


E{Y) 


J^co 


6 

6 


Thus  we  can  readily  at  run  time  use  the  observed  values  of  6,  Nn,  and  Nr  to 
calculate  a  very  good  approximation  to  the  best  forecast  possible  with  the 
available  information. 
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4  Precision  of  Forecasts 


In  this  section  we  analyze  the  potential  for  practical  utility  of  our  load- 
characterization  scheme.  We  show  that  for  the  large  numbers  of  sites  char¬ 
acteristic  of  massively-parallel  architectures,  our  scheme  provides  load  esti¬ 
mates  which  are  accurate  enough  to  be  useful  for  load  balancing, 
t  We  can  use  the  model  of  section  2  to  calculate  probability  limits  of 

forecasts — that  is,  the  region  around  the  forecast  in  which  the  actual  system- 
wide  load  will  lie  some  specified  fraction  of  the  time.  Additionally,  the  more 
detuled  model  of  section  3  specifies  how  the  individual  sites’  loads  can  be 
expected  to  be  distributed  about  the  system-wide  average  load.  What  is 
most  interesting  is  combining  these  two,  in  order  to  determine 

•  what  fraction  of  the  sites  can  be  expected  to  be  over-  or  under-loaded 
at  some  significance  level,  and 

•  how  much  relative  error  can  be  expected  in  the  amount  of  work  trans¬ 
ferred  between  sites,  due  to  erroneous  forecasts. 

Happily,  we  show  that  the  accuracy  of  the  forecasts  relative  to  the  standard- 
deviation  of  the  site  loads  goes  up  with  the  square-root  of  the  number  of 
sites,  so  that  for  massively-parallel  systems  the  uncertainty  in  the  forecasts 
is  unproblematic  (assuming  the  validity  of  the  model). 

4.1  ProbabiUty  limits  of  forecasts 

The  conditional  probability  distribution  of  the  system-wide  load  about  its 
forecast  value  is  simply  the  sum  of  those  of  the  at  not  included  in  the  forecast. 
The  error  in  the  forecast  will  thus  be  normally  distributed  with  mean  zero 
and  variance  increasing  with  lead-time.  For  the  IMA  (0,1,1)  x  (0, 1,  l)p 
model,  if  the  forecast  is  made  using  complete  information  only,  with  lead 
time  I  <  p,  the  variance  is 

We  can  use  the  above  formula  to  calculate  approximate  probability  limits 
for  the  forecasts  by  substituting  an  estimate  for  <7a.  One  approach  would  be 
to  estimate  it  using  the  sample  standard  deviation  from  prior  runs.  Prior  to 
the  introduction  of  load  bsdancing,  the  detailed  model  of  section  3  certainly 
doesn’t  apply,  but  the  system-wide  model  of  section  2  presumably  does,  at 
least  approximately.  Therefore,  the  sample  variance  of  the  system-wide  load 
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should  be  used  as  an  initial  estimate  for  Na^,  rather  than  starting  with  the 
sample  variance  of  individual  site  loads.^  IT  the  system-wide  load  sample 
standard  deviation  is  s,  then  we  can  estimate  that  with  probability  €  the 
actual  load  differs  from  the  lead  /  forecast  by  more  than 

where  Ui/i  is  the  c/2-tail-area  point  of  the  unit  normal  distribution.  Notice 
that  these  bounds  are  for  the  total  load — the  standard  deviation,  and  hence 
probability  limits,  for  the  average  load  are  smaller  by  a  factor  of  N. 

4.2  Comparison  with  the  distribution  of  site  loads 

Our  model  asserts  that  the  loads  of  the  individual  sites  at  any  time  are 
normally  distributed  about  the  system-wide  average  load  with  standard  de¬ 
viation  a  a.  We  can  compare  this  with  the  standard  deviation  of  the  lead  I 
conditional  probability  distribution  of  the  average  load,  which  we  derived  in 
the  previous  subsection.  The  latter  is  larger  by  a  factor  of 

yfN 

the  factor  of  y/N  results  from  averaging  N  independent  deviates. 

This  implies  that  for  large  systems  the  forecasts  will  be  accurate  enough 
to  be  useful.  For  example,  if  the  system  of  section  2  could  be  spread  among 
1024  sites,  even  one-period-ahead  forecasts  would  have  a  factor  of  27  lower 
standard  deviation  than  the  site  loads.  Thus  virtually  all  apparent  over-  or 
under-loads  would  be  statistically  signiiicaat,  and  the  relative  error  in  the 
amount  of  work  transferred  would  be  small  (roughly  1/27). 

5  Load-balancing  Mechanism 

In  this  section  we  ontUne  a  load-balancing  scheme  employing  the  load- 
characterization  methodology  of  the  preceding  sections.  Our  scheme  relies 
on  a  "rumor  mongering”  style  of  information  spreading  [9],  which  is  appro¬ 
priate  to  our  arcUtecture.  We  show  that  the  mechanism  not  only  allows 
sites  to  assess  their  load  with  respect  to  the  system-wide  average,  but  also 

*We  only  wrote  the  fermnls  in  terms  of  the  per-rite  ai  in  order  to  be  notetionnlly 
consistent  with  section  3. 
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allows  overloaded  sites  to  reliably  find  sufficiently  underloaded  sites  to  which 
objects  can  be  migrated. 

If  each  site  stores  its  knowledge  of  aD  sites’  load  histories,  then  they  can 
spread  thar  information  around  by  a  process  of  “rumor  mongering”— that 
is,  by  randomly  sharing  information  [10,1,2,9].  Naturally,  the  histories  can 
be  compressed  by  discarding  information  old  enough  to  be  scarcely  relevant 
and  by  fnmhining  together  loads  from  all  sites  where  they  all  are  known. 
Some  information  may  be  young  enough  to  relevant  to  forecasting,  but  old 
pnniigh  to  be  well-known.  This  information  can  be  retained  but  not  passed 
on;  [9]  has  a  good  discussion  of  such  issues. 

Our  CARE  ensemble  architecture  (8)  uses  a  cut-through  interconnec¬ 
tion  network,  so  latency  is  not  proportional  to  distance  (in  the  absence  of 
contention).  Additionally,  it  supports  an  efficient  multicast  protocol  [5]. 
Therefore,  we  suggest  that  the  information  spreading  be  achieved  by  each 
site  periodically  multicasting  its  information  to  a  random  sample  of  the 
other  sites.  While  the  number  of  sites  that  each  site  will  hear  from  in  any 
given  period  varies,  it  can  be  shown  that  the  distribution  (a  binomial  distri¬ 
bution,  rapidly  approaching  a  Poisson  distribution)  is  such  that  a  paucity  of 
information  vdll  be  rare,  even  with  a  quite  moderate  sample  size,  e.g.  eight. 

Upon  receiving  a  load-information  message,  a  site  should  integrate  the 
information  into  its  own  knowledge,  and  then  use  the  time-series  model  (pro¬ 
vided  a  priori  based  on  experiments  with  the  particular  system)  to  estimate 
the  current  system-wide  average  load  with  probability  limits.  It  should  then 
compare  this  predicted  average  with  its  own  current  load,  and  with  the  load 
of  the  sender  at  the  time  of  the  sending.  If  the  recipient  appears  significantly 
underloaded  and  the  sender  appears  significantly  overloaded,  a  request  for 
work  should  be  sent  back. 

This  is  a  combination  of  random  gossiping  to  distribute  the  information 
needed  to  decide  whether  and  how  much  work  to  transfer,  together  with 
polling/bidding  to  match  up  the  participating  sites.  As  with  all  bidding 
schemes,  some  precautions  are  needed  to  avoid  races.  The  underloaded 
site  should  not  place  any  other  requests  for  work  until  it  receives  work  or 
an  apology  from  the  overloaded  site.  As  the  inter-arrival  time  for  messages 
from  overloaded  sites  should  be  high  relative  to  the  round-trip  message  time, 
few  conflicts  should  occur. 

The  bidding  could  be  reversed  (overloaded  sites  could  ask  underloaded 
sites  to  accept  work),  but  this  would  require  that  an  extra  message  be  sent. 
The  system  as  we  present  it  can  best  be  classified  as  receiver-initiated  [11], 
thou^  in  a  sense  the  sender  initiates  the  process  by  multicasting  its  load 
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informatioii.  This  confusion  of  terminology  results  firom  our  integration 
of  the  global-information-spreading  and  partner-seeking  components  of  the 
mechanism. 

It  should  be  rare  that  an  overloaded  site  cannot  find  enough  total  un¬ 
derload  among  the  sites  it  samples  to  match  its  own  overload.  For  example, 
suppose  that  the  loads  are  normally  distributed  (as  they  are  in  the  model 
[  of  section  3),  and  that  the  sample  size  is  eight.  Of  the  eight  sites  sampled, 

it  can  be  expected  that  four  will  be  underloaded.  The  expected  value  of  the 
absolute  value  of  a  normal  deviate  is  2/^t,  or  about  .8  standard  devia¬ 
tions,  so  the  four  underloaded  sites  will  on  the  average  have  approximately 
3.2  standard  deviations  worth  of  underload.  But  the  originating  site  must 
really  be  far  out  on  the  tail  of  the  distribution  to  have  more  than  3.2  stan¬ 
dard  deviations  worth  of  overload.  Notice  that  it  is  impossible  to  make  as 
strong  a  statement  in  the  reverse  direction — this  is  an  additional  reason  to 
favor  a  receiver-initiated  transfer  (it  is  more  important  for  overloaded  sites 
to  reliably  find  underloaded  sites  than  the  converse). 

The  only  aspect  of  load  balancing  not  addressed  by  this  mechanism  is  the 
I  choice  of  which  objects  to  migrate.  Here  again  the  real-time  nature  of  the 

system  must  be  addressed.  In  general  neither  the  highest-  nor  lowest-priority 
objects  are  best  migrated,  so  as  to  neither  unfairly  advance  a  low-priority 
object  nor  hold  up  (due  to  migration  time)  a  high-priority  object.  Chang 
addresses  these  issues  in  [6]. 
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